-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch cuda dev tools (#736) #743
Conversation
Thanks for working on this! rocker-versioned2/scripts/install_R_source.sh Lines 98 to 139 in 6efce3a
|
Thanks @eitsupi ! Yes, great point, that does sound like a much simpler solution! I'll scratch this and take a go at that instead. |
@eitsupi okay looping back to this using your simpler mechanism. Can you take a look over this when you get a chance? Thanks! |
I'm sorry my comment is confusing. |
scripts/install_R_source.sh
Outdated
@@ -157,7 +158,9 @@ rm -rf "R.tar.gz" | |||
cp /usr/bin/checkbashisms /usr/local/bin/checkbashisms | |||
|
|||
# shellcheck disable=SC2086 | |||
apt-get remove --purge -y ${BUILDDEPS} | |||
if [ "${PURGE_BUILDDEPS}" == "TRUE" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eitsupi you said:
it would be better to use the same script with the flag added to install_R_source.sh and skipped the "build R" steps so that we don't have to have almost identical copies in two files.
I am adding a flag to install_R_source.sh right here though. what two files are you referring two as almost identical copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the assumptions were not engaged. I had imagined something like the following:
FROM rocker/r-ver:4.3.2 as builder
FROM FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
COPY --from=builder /foo/bar
COPY scripts/install_R_source.sh /rocker_scripts/install_R_source.sh
# should install the necessary packages, but should skip building R
ENV SKIP_BUILD_R="true"
RUN /rocker_scripts/install_R_source.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see what you mean. That looks rather more complicated though.
- It requires using the multi-stage build, and i'm not sure how easy that is to add to our existing build stack.
- Copying
/foo/bar
isn't so clean, because our standard build installs R_HOME into /usr/local, with binaries in /usr/local/bin, etc, rather than freestanding. Yes, I proposed a solution around this in my previous proposal, but it is a significant change that could break some user expectations. - SKIP_BUILD_R would have to wrap around thin install-time deps, building of R and and clean up and purge of dependencies. That's fine, but it feels more natural to me to have an option that just avoids the
apt remove
step. After all, given thatapt remove
has unexpected consequences, it seems reasonable to be able to opt out of that without opting out of the rest of it. and intuitively, it's easier I think to guess whatinstall_R_source.sh
does behaves in response to a option ofPURGE_BUILDDEPS=FALSE
than it is to guess whatinstall_R_source.sh
does in response to an option ofSKIP_BUILD_R
.
What's the downside of the approach here? I still don't see what's been duplicated.
apologies about my lint errors, looks like a whitespace issue but I'm not sure. Happy to change strategies again if that's better. I think adding the option of simply turning off the uninstall of builddeps is the least invasive way to address this issue with minimal changes. Please let me know if there's a a better alternative, but I think I'd rather avoid introducing staged builds merely to address the problems created by the I did slip a few other changes in here that look like they got missed in #740 (that updated the necessary env vars for most but not all recipes that use the the revised install_python.sh script, but I think we want that on cuda images too). This also corrects a small permissions issue in install_python.sh. Let me know if it would be better to do these bits as a separate PR, just thought it made sense to fix them while I was testing these cuda images anyhow. Thanks for the review and suggestions! |
@@ -258,7 +258,7 @@ | |||
"PYTHON_CONFIGURE_OPTS": "--enable-shared", | |||
"RETICULATE_AUTOCONFIGURE": "0", | |||
"PURGE_BUILDDEPS": "FALSE", | |||
"VIRUTAL_ENV": "/opt/venv", | |||
"VIRTUAL_ENV": "/opt/venv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed! will be nice to have /opt/venv work on these images.
@cboettig I understood. Indeed, I think this simple solution can be implemented relatively quickly and is worthwhile. By the way, the Lint error looks like an indent size issue. |
scripts/install_R_source.sh
Outdated
@@ -11,6 +11,7 @@ | |||
set -e | |||
|
|||
R_VERSION=${1:-${R_VERSION:-"latest"}} | |||
PURGE_BUILDDEPS=${PURGE_BUILDDEPS=-"TRUE"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think bools in env vars should be "true"
or "false"
, not "TRUE"
or "FALSE"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think env vars are untyped so 1 and "1" and "TRUE" and "True" and "true" are all the same. Unless, of course, the receiving end makes a difference when it parses. Otherwise the check is commonly 'set or not set'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a common practice in shell script?
I saw in bash:
$ if "true"; then echo "true"; fi
true
$ if "false"; then echo "true"; fi
$ if "1"; then echo "true"; fi
bash: 1: command not found
$ if "TRUE"; then echo "true"; fi
bash: TRUE: command not found
$ if "True"; then echo "true"; fi
bash: True: command not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, but as you I did this a string comparison since booleans in bash always confuse me. Anyway I've changed these to lower-case now. Thanks for catching this! let me know if you'd recommend further changes here.
stacks/devel.json
Outdated
@@ -148,15 +148,17 @@ | |||
"org.opencontainers.image.title": "rocker/cuda", | |||
"org.opencontainers.image.description": "NVIDIA CUDA libraries added to Rocker image." | |||
}, | |||
"FROM": "nvidia/cuda:11.8.0-cudnn8-devel-ubuntu24.04", | |||
"FROM": "nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, this value is updated by the script, so don't update it here manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered about this when the script added it. Are we sure a) the production cycle for Ubuntu 24.04 has started and b) that NVidia already publishes to it? I was under the impression NVidia was particularly 'laggy' ie the gcc
version implied by nvcc
was always ages behind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no nvidia/cuda:11.8.0-cudnn8-devel-ubuntu24.04
. There are
nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
nvidia/cuda:11.8.0-cudnn8-devel-ubuntu18.04
(EOL)
See https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/11.8.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all. @eitsupi can you clarify? If this field is not read because it is provided by the script then maybe we can drop it or stick in a dummy value? It seems needlessly confusing to reference a value that isn't going to be used and isn't actually a valid image name but just happens to look almost like a valid image name, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is "not read", but "don't set it manually, because if the script is execute by CI, it will be updated".
The priority here is to use the same Ubuntu version as rocker/r-ver
. I haven't found an easy way to check if that image actually exists (This is definitely possible using buildx, but I couldn't figure out how to check without logging into the Container Registry.), so no such processing is implemented.
If you try to build it, it will simply fail.
Currently pointing to Ubuntu 24.04 is not as intended, I really wanted it to be 22.04, which is the same as the current ubuntu:latest
. In other words, this is a bug in the script. (The 24.04 release date is later than today, so it should be filtered out)
rocker-versioned2/stacks/devel.json
Line 26 in 7d08837
"FROM": "ubuntu:latest", |
rocker-versioned2/build/make-stacks.R
Lines 449 to 450 in 7d08837
# Update the cuda base image | |
template$stack[[9]]$FROM <- .cuda_baseimage_tag(dplyr::last(df_ubuntu_lts$ubuntu_series)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so it now reads 22.04, and I think should work as intended. let me know if anything needs to be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so it now reads 22.04, and I think should work as intended. let me know if anything needs to be changed?
Run the scripts to update the files. It should be back.
rocker-versioned2/build/README.md
Lines 5 to 12 in 7d08837
## Automatic update of container definition files | |
The Dockerfiles for pre-build images will be generated by executing the scripts in the following order. It will also generate the `docker-bake.json` files that can be used to build the container images. `make-matrix.R` will generate files containing the matrix used in GitHub Actions. | |
1. `./build/make-stacks.R` | |
2. `./build/make-dockerfiles.R` | |
3. `./build/make-bakejson.R` | |
4. `./build/make-matrix.R` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eitsupi ok I see what you mean, make-stacks.R
tries to figure this out automatically and it's coming up with the nonsense answer of 24.04 as the answer on the devel image. I still think that's confusing (doesn't that mean we're just failing to build a devel flavor for the whole cuda stack?) but I agree it's a separate issue and not one that needs to be solved in this PR.
I've just pushed an edit to roll that back.
I thought I fixed whitespace but am still struggling with the linter :-(
@eitsupi any further thoughts or edits you'd like to see here before we merge? |
Don't you need to edit other stack files? Don't forget to also run other build scripts. The linter error indicates that the number of indentations is different. (The indent size in other places is 4, but the line you added is 2) |
Thanks @eitsupi ! I've fixed the whitespace issue with linting now and run the remaining scripts. I'd like to at least start with these changes only in the active images (devel and latest, though looks like the scripts automatically back-port this to the most recent locked release as well). Traditionally we have allowed images to continue to evolve at the front end while trying to keep old versions as stable and locked as possible, at least while still working out features. In some cases (maybe including this one) we might want to port this to all 4.x releases, but maybe we can revisit that once this is merged and working?
|
As your are aware, these scripts run automatically once a day on the main branch, so in many cases there is no problem even if no changes are made on the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to at least start with these changes only in the active images (devel and latest, though looks like the scripts automatically back-port this to the most recent locked release as well). Traditionally we have allowed images to continue to evolve at the front end while trying to keep old versions as stable and locked as possible, at least while still working out features. In some cases (maybe including this one) we might want to port this to all 4.x releases, but maybe we can revisit that once this is merged and working?
Makes sense. Let's try merging.
I don't know what the CI failures indicate - TeX related stuff breaks a lot, so I don't know if this is temporary or not. I don't think it has anything to do with this PR. |
the tex error confuses me too. Like you say, it's always a fragile point. locally that test has no problems for me. |
@@ -11,6 +11,7 @@ | |||
set -e | |||
|
|||
R_VERSION=${1:-${R_VERSION:-"latest"}} | |||
PURGE_BUILDDEPS=${PURGE_BUILDDEPS=-"true"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe typo?
PURGE_BUILDDEPS=${PURGE_BUILDDEPS:-"true"}
@eitsupi my apologies, I didn't realize my local test was pulling the upstream image. The tex issue is specific to the changes in #743 -- specifically, the official ubuntu repo texlive binaries are installed and purged in the BUILDDEPS list. By not purging builddeps, we had conflicting versions of texlive. This PR modifies the `install_texlive.sh` to not do the manual install of tlmgr from CPAN if the texlive has already be installed by apt-get (i.e. `/usr/bin/latex` already exists). I believe it should resolve the issue.
Just a draft proposal here so far, but I think the least-friction way to do this. Doesn't impact the normal r-ver stack. In the cuda stack, the idea would be to copy the compiled version of R from r-ver rather than run the typical
install_R_source.sh
script, which creates the problems in purging dependencies. To make this work the runtime dependencies still need to be installed, so for simplicity I've added these in another script. (technically that means we have these dependencies specified in two places now, so I could adjustinstall_R_source.sh
to source the new script instead, not sure if that's preferable?).I haven't taken a go at defining the new bakefile template yet. Hopefully straight forward, though will need to make sure it gets the right version tag for r-ver images in the non-latest images; I haven't thought out the logic there.
@eitsupi thoughts?